feat: broadcast shutdown notification to active threads (RFC #78 §1d Phase 1) by ruan330 · Pull Request #182 · openabdev/openab

ruan330 · 2026-04-10T11:25:33Z

Summary

On SIGINT or SIGTERM the broker now posts a short notification to every active thread across all configured adapters (Discord, Slack, …) before closing shards. Users get a clear signal that the broker is going away instead of replies cutting off silently mid-stream.

Discord Discussion URL: https://discord.com/channels/1488041051187974246/1495050997461024879

Relates to: #78, #75

Problem

Today, when the broker is restarted (systemctl restart, docker stop, Ctrl+C), conversations just die silently mid-reply. Users have no signal that anything is happening until the broker comes back, and in-flight streams leave partial messages with no explanation. RFC #78 §1d calls this out as Phase 1.

The original #182 (rebased before v0.7.6) was Discord-specific and done at the main.rs level. The multi-platform refactor in #259 makes that approach stale: Slack threads also need the notification, and the AdapterRouter / ChatAdapter abstraction is the right home for platform-neutral broadcast. This revision rewrites the feature against that architecture.

Design

Pool owns addressing, not the router. SessionPool now stores each active session's ChannelRef + Arc<dyn ChatAdapter> alongside the connection (addresses: HashMap<thread_key, ...>), kept in lockstep with state.active so broadcast has a single source of truth. No parallel cache in the adapter layer, no pool.active_thread_ids() filter, no prune task.

Atomic admission and snapshot.

SessionPool::begin_shutdown() takes the state write lock, flips the shutdown flag, and snapshots addresses — all in one critical section. Any admission committed before us is in the snapshot.
get_or_create checks is_shutting_down() at four points so every session is either in the snapshot or rejected at admission:
1. Fast-fail on entry.
2. Inside the initial state.read() block.
3. Before return Ok(()) on the existing-alive-session path, with a state.read() barrier synchronizing against begin_shutdown's state.write() flag flip (closes the race where shutdown starts while we wait on the per-connection mutex).
4. Inside the final state.write() block before insert.

Platform-neutral broadcast. AdapterRouter::broadcast_shutdown(message, timeout) calls pool.begin_shutdown() and posts the notification to every snapshot entry in parallel via tokio::task::JoinSet. A tokio::select! with a configurable deadline (10s here) caps total broadcast time so shutdown itself never blocks on a slow platform.

Unified signal handling. main.rs uses a shared wait_for_shutdown_signal() helper for SIGINT + SIGTERM (#[cfg(unix)]-gated with a ctrl_c-only fallback for non-Unix). Both the Discord-enabled and Slack-only branches invoke broadcast_shutdown before tearing down adapters.

Behaviour notes

Delivery is best-effort: send errors and the 10-second deadline both fall through to normal pool teardown.
Wording is intentionally neutral ("Context will reset on return") — we don't promise automatic session resume. RFC RFC: Session Management — Design Proposal #78 Phase 2 persistence is a separate follow-up.
Per-channel parallel send is safe under Discord/Slack rate limits (limits are per-channel, not per-adapter) and is ~10× faster than sequential for deployments with several active threads.

Tests & verification

cargo test — 43 passed, 0 failed.
cargo clippy --all-targets -- -D warnings — clean.
Live-tested on bare-metal and Docker brokers: SIGTERM triggers broadcasting shutdown notification count=N log and the message arrives in every active thread.

Test plan

CI: cargo test, cargo clippy, Docker smoke tests.
Manual: start broker with Discord and/or Slack configured, open an active thread, send SIGTERM (docker stop / kill -TERM), verify the thread receives the "Bot is shutting down" message before the pool tears down.
Manual: broadcast timeout path — block network to Discord and verify shutdown still completes after 10s.

masami-agent · 2026-04-16T08:48:59Z

Review — PR #182

Good feature — users currently get no warning when the broker shuts down. This fixes that.

✅ What looks good

SIGTERM support — main only has ctrl_c() today. Adding SIGTERM via tokio::select! is important for k8s graceful shutdown.
active_thread_ids() — clean, minimal addition to pool.
Broadcast before shutdown — correct ordering. HTTP client is still alive at broadcast time (before shard_manager.shutdown_all()).
Sequential broadcast — safer than parallel given Discord rate limits (50 req/s). No need to add join_all complexity.
Neutral wording — doesn't promise auto-resume.
Failure handling — per-thread warn on send failure, no crash.

🔴 Must fix before merge

1. Rebase onto latest main
The pool structure has changed:

PR uses self.connections — main uses self.state (field: RwLock<PoolState>, inner: state.active)
main.rs shutdown handler has been restructured (Slack adapter support, shutdown_tx channel)
Handler struct is completely different

active_thread_ids() will need to read self.state.read().await.active.keys() after rebase.

2. CI not triggered
Fork PR — needs maintainer approval to run CI.

🟡 Non-blocking

3. k8s terminationGracePeriodSeconds
Default is 30s. Sequential broadcast with 20+ active threads could take 2-10s (each channel.say() is ~100-500ms). If Discord API is slow, broadcast may get truncated by SIGKILL. Not a blocker — worst case is some threads miss the notification. But worth noting for operators with large session pools.

4. "Restarting" message on permanent shutdown
The message says "Broker restarting" but helm uninstall is a permanent shutdown. No good way to distinguish at runtime — acceptable as-is.

Summary

Valuable UX improvement. Rebase needed — pool and main.rs have both changed significantly.

masami-agent

PR Review: #182

Summary

Problem: Broker restarts kill conversations silently — users get no notification that the broker is going down.
Approach: Add active_thread_ids() accessor to SessionPool, handle SIGTERM alongside SIGINT, and broadcast a neutral notification to each active Discord thread before shutting down shards.
Risk level: Low

Core Assessment

Problem clearly stated: ✅ — well-motivated, linked to RFC #78 §1d
Approach appropriate: ✅ — minimal, clean separation between pool (lifecycle) and main (Discord)
Alternatives considered: ✅ — PR description covers three alternatives with clear reasoning
Best approach for now: ✅ — Phase 1 scope is correct; no over-engineering

Findings

pool.rs — active_thread_ids()

Clean and minimal. Takes a read lock, returns an owned Vec<String>, releases immediately. No contention risk with the subsequent shutdown() write lock. ✅

main.rs — SIGTERM handling

The tokio::select! over ctrl_c() and sigterm.recv() is the standard pattern. The .expect("install SIGTERM handler") is acceptable here — if we cannot install a signal handler at startup, panicking is the right call since the broker would be unable to shut down gracefully anyway. ✅

main.rs — broadcast loop

The strip_prefix("discord:") + parse::<u64>() approach correctly handles the {platform}:{id} key format used by AdapterRouter. Non-Discord keys (e.g. slack:...) are silently skipped via the parse failure path, which is the right behavior for now. ✅

main.rs — error handling

channel.say() failures are logged as warnings and do not block shutdown. Correct — a dead Discord connection should never prevent pool cleanup. ✅

Review Summary

🔴 Blockers

Cargo.lock version bump is out of scope. The diff changes Cargo.lock from 0.7.6 → 0.7.7. On main, Cargo.toml is already 0.7.7 but Cargo.lock is stale at 0.7.6. While the sync is technically correct, it is unrelated to the shutdown broadcast feature and should not be bundled into this PR. Please revert the Cargo.lock change and let it be fixed in a separate housekeeping commit, or confirm this was an intentional rebase artifact that the maintainers are okay with.

💬 Questions

Non-Discord shutdown path also lacks SIGTERM. The else branch (line ~194 on main: "running without discord, press ctrl+c to stop") still only awaits ctrl_c(). In a Slack-only or headless deployment, docker stop / kill would still bypass the graceful path. Was this intentionally left out of scope, or should it be addressed here for consistency?
Slack adapter threads. When Slack support is active, thread_key will be slack:{id}. The broadcast loop silently skips non-Discord keys (the parse::<u64>() fails on slack:... after strip_prefix("discord:") returns the full key). This is fine for Phase 1, but worth noting — should there be a tracing::debug! for skipped non-Discord keys so operators know the broadcast was Discord-only?

🔧 Suggested Changes

Consider a timeout on the broadcast loop. If there are many active sessions and Discord is slow or rate-limiting, the broadcast could delay shutdown significantly. A tokio::time::timeout wrapping the entire loop (e.g. 5–10 seconds) would provide a safety net. Not blocking, but worth considering for production robustness.
The pool.shutdown() is called twice on the Discord path. After the spawned task runs shard_manager.shutdown_all(), execution continues past client.start().await? and hits the shutdown_pool.shutdown().await at line ~208. The broadcast task does NOT call pool.shutdown() — it only broadcasts and shuts down shards. So the pool is shut down once (at the bottom), which is correct. However, the broadcast task clones pool as broadcast_pool but never calls shutdown on it. This is fine, but the naming broadcast_pool could be clearer — maybe notification_pool or just pool_ref to signal it is read-only usage.

ℹ️ Info

The Cargo.lock discrepancy on main (Cargo.toml = 0.7.7, Cargo.lock = 0.7.6) is a pre-existing issue. It should be fixed, but in a separate commit.
RFC #78 §1d Phase 2 (session persistence) is correctly deferred. The neutral wording ("you can continue" vs "will resume") is the right call given no state is persisted yet.

⚪ Nits

Line length on the channel.say() string literal is ~90 chars — fine, but if you want to break it for readability, a const at the top of the block would work nicely:
```
const SHUTDOWN_MSG: &str = "🔄 Broker restarting. You can continue the conversation when the broker is back.";
```

Verdict

COMMENT — One blocker (Cargo.lock scope) and two questions to resolve. The core implementation is clean and well-designed. Once the Cargo.lock issue is addressed and the questions are answered, this should be ready for approval.

masami-agent

Hi @ruan330 — leaving some inline comments on specific areas. The overall review is in my earlier comment above. Thanks for the solid work on this!

masami-agent · 2026-04-17T13:16:54Z

 [[package]]
 name = "openab"
-version = "0.7.6"
+version = "0.7.7"


🔴 BLOCKER — This version bump (0.7.6 → 0.7.7) is a pre-existing Cargo.lock / Cargo.toml sync issue on main, not related to the shutdown broadcast feature. Please revert this file from the PR to keep the scope clean. The lockfile sync can be fixed in a separate housekeeping PR or commit.

git checkout origin/main -- Cargo.lock

masami-agent · 2026-04-17T13:16:54Z

+                // systemctl stop / docker stop / kill deliver SIGTERM; without handling
+                // it, the broker would exit without running the broadcast below.
                let shard_manager = client.shard_manager.clone();
+                let broadcast_pool = pool.clone();


💬 Question — The else branch below (non-Discord / Slack-only mode, around line 194 on main) still only awaits ctrl_c(). In a Slack-only or headless deployment, docker stop / kill would still bypass the graceful path. Was this intentionally left out of scope for this PR? If so, might be worth a // TODO: comment noting that SIGTERM handling should be added here too.

masami-agent · 2026-04-17T13:16:54Z

+                    info!(count = thread_ids.len(), "broadcasting shutdown notification");
+                    for thread_key in thread_ids {
+                        // thread_key format: "platform:id" (e.g. "discord:1234567890").
+                        let raw_id = thread_key


🟡 Suggestion — Consider wrapping the broadcast loop in a timeout to guard against Discord rate limits or network issues delaying shutdown:

let _ = tokio::time::timeout( std::time::Duration::from_secs(10), async { for thread_key in thread_ids { // ... existing broadcast logic ... } }, ).await;

Not blocking — just a safety net for production. If Discord is slow, you probably still want the process to exit within a bounded time.

masami-agent · 2026-04-17T13:16:54Z

+                            .strip_prefix("discord:")
+                            .unwrap_or(&thread_key);
+                        if let Ok(id) = raw_id.parse::<u64>() {
+                            let channel = serenity::model::id::ChannelId::new(id);


ℹ️ Info — When Slack adapter threads exist, their keys (slack:{id}) will pass through strip_prefix("discord:") unchanged (returns None → unwrap_or gives the full key), then parse::<u64>() fails on slack:12345 and they're silently skipped. This is correct behavior for Phase 1, but a tracing::debug! here for skipped non-Discord keys would help operators understand why some sessions didn't get notifications:

let raw_id = thread_key .strip_prefix("discord:") .unwrap_or(&thread_key); if let Ok(id) = raw_id.parse::<u64>() { // ... send notification ... } else { tracing::debug!(thread_key = %thread_key, "skipping non-Discord thread for shutdown broadcast"); }

masami-agent · 2026-04-17T13:16:54Z

+                                )
+                                .await
+                            {
+                                tracing::warn!(thread_key = %thread_key, error = %e, "failed to post shutdown notification");


⚪ Nit — The message string is a bit long inline. Consider extracting it as a constant for readability:

const SHUTDOWN_MSG: &str = "🔄 Broker restarting. You can continue the conversation when the broker is back.";

Minor — skip if you prefer it inline.

shaun-agent · 2026-04-18T13:19:39Z

OpenAB PR Screening

Title: feat: broadcast shutdown notification to active threads (RFC RFC: Session Management — Design Proposal #78 §1d Phase 1)
Source: feat: broadcast shutdown notification to active threads (RFC #78 §1d Phase 1) #182
Status: moved to PR-Screening
Generated at: 2026-04-18T13:19:38.451Z
Discord thread: https://discord.com/channels/1488041051187974246/1495050997461024879

Screening report

## Intent

This PR tries to make broker shutdowns visible to users who are actively chatting in Discord threads. Today, a broker restart or stop can terminate an in-flight conversation without explanation, leaving a partial reply or a dead thread with no operator-visible or user-visible signal. The proposed change addresses that gap by sending a shutdown notice to all currently active Discord threads before the broker tears down the session pool.

Feat

This is a feature with some operational-hardening characteristics. In plain terms, when the OpenAB broker receives SIGINT or SIGTERM, it gathers the active thread IDs and posts a fixed Discord message to each thread before shutting down the shard manager and session pool. It does not preserve session state, recover streams, or resume work after restart; it only makes the interruption explicit.

Who It Serves

The primary beneficiary is Discord end users in active threads, with secondary value for maintainers and operators who want shutdown behavior to be less opaque and easier to reason about during deploys, restarts, or crashes followed by restart.

Rewritten Prompt

Implement Phase 1 of graceful shutdown notification for the Discord broker.

When the process receives either SIGINT or SIGTERM, enumerate all currently active Discord thread IDs from the session pool, post a short restart notice to each valid thread, log notification failures as warnings, and then continue normal shutdown. Keep Discord-specific delivery logic in the broker entrypoint rather than pushing it into the pool internals. Do not add persistence, resume semantics, or a drain window in this change. Add or update tests where practical to verify signal handling and that active-thread notification happens before pool teardown.

Merge Pitch

This is worth advancing because it closes a user-facing failure mode with low implementation surface area: silent thread death during broker restarts. The change is operationally useful, narrow in scope, and aligned with RFC #78 Phase 1.

The risk profile is moderate-low. The likely reviewer concern is not whether the feature is useful, but whether shutdown-path messaging is too coupled to Discord delivery, too optimistic under rate limits, or under-tested for edge cases like invalid thread IDs and partial shutdown failures.

Best-Practice Comparison

OpenClaw principles that are relevant here:

Explicit delivery routing is relevant. This PR makes shutdown communication an explicit broker action aimed at the affected threads rather than leaving behavior implicit.
Retry/backoff and run logs are somewhat relevant. The PR logs failures, but it does not retry notification delivery or produce durable shutdown event records.
Gateway-owned scheduling is only weakly relevant. This is shutdown orchestration, not scheduled work.
Durable job persistence is not addressed. The PR explicitly stops short of preserving or replaying active session state.
Isolated executions are mostly not relevant to this specific change.

Hermes Agent principles that are relevant here:

Fresh session per scheduled run is conceptually adjacent but not directly applicable. This PR does not create a new run model; it just acknowledges interruption.
Self-contained prompts for scheduled tasks are not especially relevant.
Atomic writes for persisted state and file locking to prevent overlap are not relevant because this phase intentionally avoids persistence.
Gateway daemon tick model is also not directly relevant, since this is signal-driven shutdown behavior rather than periodic coordination.

Net assessment:

The PR aligns best with the principle of explicit operational lifecycle handling.
It does not yet align with the durability and recovery patterns seen in stronger production-grade systems.
That is acceptable if the item is judged strictly as Phase 1 notification, not as full graceful shutdown.

Implementation Options

Option 1: Minimal signal-path broadcast

Keep the current approach: handle SIGINT and SIGTERM in main, fetch active thread IDs from the pool via a narrow accessor, post a fixed message sequentially, warn on failures, then continue shutdown.

This is the most conservative path. It preserves existing boundaries reasonably well and keeps the PR tightly scoped to user notification.

Option 2: Lifecycle notifier abstraction

Introduce a small shutdown notifier component or trait owned by the broker layer. The pool exposes active session identifiers, while the broker invokes a notifier responsible for formatting and delivering shutdown notices. The initial implementation still targets Discord threads, but the shutdown path becomes easier to test and less ad hoc.

This is the balanced option. It adds a small abstraction without dragging Discord concerns into SessionPool or overengineering the feature.

Option 3: Broader graceful-shutdown framework

Implement a more complete shutdown coordinator: optional drain window for in-flight streams, bounded concurrency for thread notifications, structured shutdown result logging, and groundwork for Phase 2 persistence or resumable session metadata.

This is the ambitious option. It better matches mature gateway/runtime patterns, but it expands scope and risks turning a clean Phase 1 merge into a broader design exercise.

Comparison Table

Option	Speed to ship	Complexity	Reliability	Maintainability	User impact	Fit for OpenAB right now
Minimal signal-path broadcast	Fast	Low	Medium	Medium	Immediate and clear	Strong
Lifecycle notifier abstraction	Medium	Medium	Medium-High	High	Immediate and clear	Strong if reviewers want cleaner boundaries
Broader graceful-shutdown framework	Slow	High	High	Medium-High	Highest long-term value	Weak for this PR, better as follow-up work

Recommendation

Recommend the balanced version of Option 1 leaning slightly toward Option 2: merge the shutdown notification behavior now, but keep the design disciplined enough that notification delivery is clearly broker-owned and easy to evolve.

That is the right step for follow-up discussion because it solves the real user problem immediately without pretending to solve persistence or resumability. If reviewers are comfortable with the current narrow accessor plus main-layer delivery, this can merge as-is with a request for tighter tests. If they want cleaner seams, the only acceptable expansion should be a very small notifier abstraction, not a full graceful-shutdown framework. Phase 2 persistence, drain timing, and rate-limit mitigation should stay split into later items.

ruan330 · 2026-04-18T13:41:58Z

Force-pushed fa106cd — apologies for the churn. The previous commit (82c8a7e) was a straight rebase of the original Discord-only design, which no longer matches the multi-adapter shape the codebase moved to after #259. This revision rewrites against it.

Mapping to the screening concerns:

Too coupled to Discord — broadcast now routes through ChatAdapter::send_message via a new AdapterRouter::broadcast_shutdown(), so Slack threads get the notification alongside Discord. Discord specifics stay in the adapter.
Rate limits — JoinSet parallel send bounded by a 10s tokio::select! deadline. Best-effort by design.
Edge cases — addressing moved into SessionPool next to state.active, and begin_shutdown() flips the flag + snapshots under one state write lock. Admission invariant: every session is either in the snapshot or rejected inline at get_or_create.

Design alternatives considered (before picking pool-owned addressing):

Router-only, track in-flight handle_message calls — rejected; narrows broadcast scope too much (recently-chatted threads that aren't mid-reply wouldn't be notified).
Router cache + pool filter — rejected; coordinating admission/snapshot atomicity across two data sources produced multiple race windows in review.
Pool owns addressing (chosen) — single source of truth, admission atomicity reuses pool's existing state lock, adapter layer stays stateless.

Tests: 43 passed, clippy clean. CI green.

ruan330 · 2026-04-18T14:09:30Z

Force-pushed 12aa620 to address the remaining feedback:

Extracted SHUTDOWN_MSG + SHUTDOWN_BROADCAST_TIMEOUT to module-level consts (@masami-agent's ⚪ nit). The message is defined once and reused across the Discord and Slack-only shutdown paths.
Closed a Slack feedback-loop in the admission-reject path. The earlier revision sent an inline "Bot is shutting down" message back to any sender during the shutdown window. When the Slack Socket Mode loop is still running, our own broadcast post is delivered back as a message event — under allow_bot_messages = "all" it routes into handle_message, hits the reject branch, and generates another bot-authored reply, looping until the bot-turn cap. Fix: skip the inline rejection when sender.is_bot; human senders still get the notice. Discord is unaffected because the handler filters msg.author.id == bot_id up front.

Tests: 43 passed, clippy clean. CI green.

shaun-agent · 2026-04-19T08:14:42Z

small maintainer note: the earlier CI failure was not a design objection to the shutdown feature itself. the real issue was branch drift against current main — the adapter interface had moved to sender_json, so the PR branch no longer compiled cleanly on the merge result.\n\nwe took over that compatibility cleanup on the PR branch, pushed the follow-up fixes (9e7deaa, 13563f9), and reran the workflows. current head 13563f9 is now green on CI and Docker smoke tests.\n\nso at this point the remaining discussion should be about the feature and review judgment, not merge-drift noise.

shaun-agent · 2026-04-19T08:17:12Z

handoff note: the earlier CI failures were merge-drift noise, not a feature-design rejection. we rebased/fixed the branch against current main, pushed the compatibility cleanup, and current head 13563f9 is now green on Rust CI and Docker smoke tests.\n\nat this point the remaining question is review judgment on the shutdown-broadcast design and scope, not build breakage.

masami-agent

PR Review: #182

Summary

Problem: When the broker shuts down (SIGINT/SIGTERM), active conversations die silently mid-reply with no user-facing signal. RFC #78 §1d Phase 1.
Approach: Pool-owned addressing + atomic shutdown flag + parallel broadcast via JoinSet with timeout, platform-neutral through the ChatAdapter abstraction.
Risk level: Medium — touches core pool admission logic and shutdown path.

Core Assessment

Problem clearly stated: ✅ — Excellent PR description with design rationale, behavior notes, and test plan.
Approach appropriate: ✅ — Pool-owns-addressing is the right call. No parallel cache in the adapter layer, single source of truth.
Alternatives considered: ✅ — PR description explicitly contrasts with the original Discord-specific approach and explains why the multi-platform refactor makes this the right home.
Best approach for now: ✅ — Minimal, well-scoped. Deliberately avoids session persistence (Phase 2 follow-up).

Findings

Shutdown admission invariant (pool.rs) — The 4-point check in get_or_create is thorough and correct:

Fast-fail on entry (cheap, avoids spawning ACP process).
Inside state.read() block (catches shutdowns before this task ran).
After conn.lock() with state.read() barrier (synchronizes with begin_shutdown's write-lock flag flip — this is the subtle one, and the comment explains it well).
Inside final state.write() before insert (atomic with flag-flip + snapshot).

The begin_shutdown() method correctly holds the state write lock while flipping the AtomicBool and snapshotting addresses, so any admission that committed before the lock is in the snapshot, and any that comes after sees the flag. Sound.

addresses invariant — addresses.keys() == active.keys() is maintained across all mutation paths: insert, stale rebuild, eviction, cleanup_idle, and shutdown. I traced each one — all correct.

Broadcast timeout (adapter.rs) — The biased select with set.shutdown().await is correct. set.shutdown() cancels remaining tasks and waits for them to drop, which is the right behavior for HTTP calls that should be cancellable via tokio's cooperative cancellation.

Bot-loop prevention (adapter.rs) — Smart catch. During the shutdown window, the bot's own broadcast message could arrive as a new event on Slack, trigger handle_message, get rejected by the pool, and the rejection reply would loop. Deserializing sender_json to check is_bot breaks the cycle cleanly. The Deserialize derive on SenderContext is the minimal change needed.

Signal handling (main.rs) — wait_for_shutdown_signal() with #[cfg(unix)] SIGTERM + ctrl_c, and a ctrl_c-only fallback for non-Unix, is clean and correct. Both the Discord and Slack-only branches use the same flow.

discord.rs / slack.rs — Purely rustfmt line wrapping. No logic changes. ✅

Review Summary

🔴 Blockers

(none)

💬 Questions

get_or_create signature change — The new channel: &ChannelRef, adapter: &Arc<dyn ChatAdapter> parameters are passed through from handle_message. Is there any other call site for get_or_create outside of handle_message? If so, those callers would need updating. (I only see the one call site in adapter.rs, so this should be fine — just confirming.)

🔧 Suggested Changes

Commit history cleanup — The branch has merge commits from "Codex Temp" (48034166, 9e7deaa9, 13563f9c). Since the project uses squash merge, this will be cleaned up at merge time, but if you're planning any further pushes to this branch, consider rebasing to a single commit for cleaner git log during review.

ℹ️ Info

The _sync = self.state.read().await pattern in check #3 is unusual (acquiring a read lock purely for synchronization), but the comment explains the intent clearly. The RwLock acquire provides the happens-before relationship needed to make the AtomicBool store from begin_shutdown visible. This is a valid and well-documented pattern here.
broadcast_shutdown sends via adapter.send_message() (Discord/Slack HTTP API), not through ACP connections, so pool.shutdown() clearing active connections afterward does not interfere with in-flight broadcast sends. The sequencing in main.rs (broadcast → shutdown shards → pool.shutdown()) is correct.
The AtomicBool with Acquire/Release ordering is technically redundant with the RwLock-protected checks in points #2 and #4 (which already have happens-before from the lock), but it's needed for the fast-fail path (#1) and the read-barrier path (#3). Belt-and-suspenders for shutdown correctness is the right trade-off.

⚪ Nits

(none — code is clean, comments are thorough)

Verdict

APPROVE — This is a well-designed, well-documented feature. The shutdown admission invariant is sound, the race condition coverage is thorough, and the platform-neutral broadcast approach is the right architecture. No blockers.

obrutjack

Merge checklist verified:

✅ CI all green (cargo check + 7 Docker smoke tests)
✅ masami-agent reviewed — no blockers, thorough analysis of shutdown admission invariant
✅ Multi-platform broadcast via ChatAdapter (Discord + Slack)
✅ Parallel broadcast with 10s timeout (JoinSet + deadline)
✅ SIGTERM + SIGINT handling (unix + non-unix)
✅ Bot-loop prevention during shutdown window
✅ Pool admission 4-layer shutdown check — sound
✅ Neutral wording (no "restarting" assumption)
✅ No version regression

Well-designed feature. The pool-owned addressing approach is the right architecture. Pending @thepagent.

…penabdev#78 §1d Phase 1) On SIGINT or SIGTERM the broker now posts a short notification to every active thread across all configured adapters (Discord, Slack, …) before closing shards. Users get a clear signal that the broker is going away instead of replies cutting off silently mid-stream. Discord Discussion URL: https://discord.com/channels/1488041051187974246/1495050997461024879 - `SessionPool` stores each active session's addressing info alongside its connection (`addresses: HashMap<thread_key, (ChannelRef, Adapter)>`), kept in lockstep with `state.active` so broadcast has a single source of truth. No parallel cache in the adapter layer. - `SessionPool::begin_shutdown()` flips the shutdown flag and snapshots `addresses` under the state write lock, making admission and snapshot atomic with respect to each other. - `AdapterRouter::broadcast_shutdown(message, timeout)` calls `begin_shutdown`, then posts the notification to every snapshot entry in parallel via `tokio::task::JoinSet`. A `tokio::select!` with a 10-second deadline caps total broadcast time so shutdown itself never blocks on a slow platform. - `main.rs` shutdown handler listens for SIGTERM in addition to SIGINT (`#[cfg(unix)]`-gated with a ctrl_c-only fallback for non-Unix), and runs `broadcast_shutdown` before `shard_manager.shutdown_all()`. Both the Discord branch and the Slack-only branch use the same flow via a shared `wait_for_shutdown_signal()` helper. Every session in the pool at broadcast time is either: 1. In the broadcast snapshot (receives the notification), or 2. Rejected at admission (receives an inline "Bot is shutting down" notice through `handle_message`'s pool-error path). `get_or_create` checks `is_shutting_down()` at four atomic points: 1. Fast-fail on entry (cheap). 2. Inside the initial `state.read()` block (catches shutdowns that ran before we reached this task). 3. Before the `return Ok(())` on the existing-alive-session path, with a `state.read()` barrier to synchronize with `begin_shutdown`'s `state.write()` flag flip (catches shutdowns that started while we were waiting on the per-connection mutex). 4. Inside the final `state.write()` block before insert (atomic with `begin_shutdown`'s flag flip + snapshot). - Notification is best-effort: delivery errors and the 10-second deadline both fall through to normal pool teardown. - Wording is intentionally neutral ("Context will reset on return") — we don't promise automatic session resume; RFC openabdev#78 Phase 2 persistence is a separate follow-up. - Per-channel parallel send is safe under Discord/Slack rate limits (limits are per-channel, not per-adapter) and is ~10× faster than sequential for deployments with several active threads. - `cargo test` passes: 43 passed, 0 failed. - `cargo clippy --all-targets -- -D warnings` clean. - Verified on bare-metal and Docker brokers: SIGTERM triggers `broadcasting shutdown notification count=N` log and the message arrives in every active thread. Relates to: openabdev#78, openabdev#75

ruan330 · 2026-04-20T06:28:30Z

Rebased onto current main (beta.8) to clear the new conflicts introduced since the previous head. Three commits preserved, content-equivalent to the previous 13563f9 head — shaun-agent's JSON-parse approach for the admission-reject is_bot check is kept intact, just replayed on the new base.

Summary of conflict resolution:

src/acp/pool.rs — combined upstream's cancel_handles (from feat: /models, /agents, and /cancel slash commands with ACP configOptions #478 /cancel) with F's addresses map side-by-side in PoolState; imports, init, and insert paths mirror the combine.
src/main.rs — kept upstream's allow_all_channels / allow_all_users Handler fields (feat(config): add allow_all_channels / allow_all_users flags #476) and preserved router: router.clone() so F's shutdown handler can still clone it on the line below.
src/adapter.rs — auto-merged cleanly; no manual changes.

Tests: cargo test --release → 51/51 passed. cargo clippy --all-targets -- -D warnings clean.

New head SHAs:

2cbfe26 (was 12aa620) — original F
3fe3137 (was 9e7deaa) — adapt to sender_json API
44a9cdd (was 13563f9) — derive is_bot from sender_json

Thanks for your patience with the rebase churn — happy to adjust further if reviewers want a different shape.

chaodu-agent · 2026-05-01T11:22:46Z

四法師 Collaborative Triage Review — PR #182 (2026-05-01)

Verdict: CHANGES_REQUESTED — Design is sound and well-documented. Blocking on rebase against current main (CONFLICTING). Contributor must rebase carefully to preserve features added to main since last push.

🟢 INFO — What looks good

All four monks agree the core design is correct:

4-Point Atomic Admission Check (超渡 + 覺渡): The layered shutdown check in get_or_create — (1) lock-free fast-fail, (2) state read lock, (3) _shutdown_barrier read lock after per-connection mutex, (4) final state write lock — is thorough and correctly closes all race windows between begin_shutdown and in-flight admissions.
Pool-Owned Addressing (超渡 + 覺渡 + 擺渡): SessionEntry bundling conn + channel + adapter is the right architecture. Single source of truth, no parallel cache in the adapter layer, type-level lifetime enforcement.
JoinSet Parallel Broadcast + 10s Timeout (覺渡): tokio::task::JoinSet with tokio::select! hard deadline ensures shutdown is never blocked by a slow platform. biased select + set.shutdown().await correctly cancels remaining tasks.
Bot Feedback-Loop Prevention (超渡 + 覺渡): sender_is_bot check in the admission-reject path prevents Slack broadcast → re-entry → rejection → loop.
SIGTERM + SIGINT Handling (超渡): wait_for_shutdown_signal() with #[cfg(unix)] SIGTERM and graceful fallback on handler installation failure.
2 Existing Approvals: obrutjack (merge checklist verified) + shaun-agent (manual human approval, CI green at time of review).

🟡 NIT — Non-blocking suggestions

SessionEntry visibility (覺渡): Currently private. Future monitoring features (list active sessions) may benefit from exposing it or adding getters. Non-blocking — follow-up.
Broadcast log line (覺渡): wait_for_shutdown_signal logs shutdown signal received but no log before broadcast starts. Adding info!("broadcasting shutdown notifications...") in main.rs before broadcast_shutdown() would improve observability.
begin_shutdown early-return guard (普渡): Could add if self.is_shutting_down() { return vec![]; } at the top of begin_shutdown to make double-call safe. Only one call site today, so non-blocking.
Configurable shutdown message (超渡 previous triage): Hardcoded SHUTDOWN_MSG const works for Phase 1. Operators may want customization (i18n, ETA). Track as follow-up.

🔴 SUGGESTED CHANGES — Blocking

1. Rebase Required (all four monks)

PR merge state is CONFLICTING. The branch must be rebased onto current main before merge.

2. Rebase Risk: Session Persistence Regression (擺渡)

Critical finding: Current main (post-#459) has mapping_path, load_mapping(), and save_mapping() in SessionPool for persisting thread→session mappings to ~/.openab/thread_map.json. The PR head predates this and replaces the pool struct without these fields.

If the rebase naively takes the PR version of PoolState/SessionPool::new(), session persistence will silently regress. The contributor must:

Keep main's mapping_path, load_mapping, save_mapping intact
Apply SessionEntry refactor and shutting_down flag on top of the persisted-pool implementation
Verify cargo test passes after rebase, specifically any session-persistence tests

3. Rebase Risk: main.rs Feature Regression (擺渡)

Current main has cron, markdown, gateway modules and startup paths that the PR head lacks (based on older entrypoint). Rebase must preserve all current main features — the shutdown broadcast should be additive only.

📋 Review Participants

Monk	Backend	Key Contribution
超渡法師	Kiro	Orchestration, 4-point check validation, bot-loop analysis
覺渡法師	Gemini	Deep code audit, JoinSet/timeout analysis, rebase conflict verification
擺渡法師	Codex	Baseline check — identified session persistence + main.rs regression risks
普渡法師	Claude	Implementation detail verification, `begin_shutdown` guard suggestion

Next step for contributor @ruan330: Please rebase onto current main, paying special attention to src/acp/pool.rs (preserve mapping_path/load_mapping/save_mapping from #459) and src/main.rs (preserve cron/markdown/gateway modules). Once rebased and CI green, this should be ready to merge given the existing approvals and the sound design.

chaodu-agent · 2026-05-01T11:23:45Z

Follow-up: Precise Rebase Guidance (四法師 supplemental)

普渡法師 verified current main baseline and found:

main already has shutdown_signal() (line 29) — functionally equivalent to the PR's wait_for_shutdown_signal() (SIGINT OR SIGTERM, #[cfg(unix)] + fallback).
main does NOT have broadcast_shutdown, begin_shutdown, SessionEntry, or the 4-point admission check — these are all net-new from this PR.

Rebase checklist for @ruan330

File	Action
`src/main.rs`	Delete `wait_for_shutdown_signal()`. Use `main`'s existing `shutdown_signal()`. Insert `broadcast_shutdown()` call before shard teardown. Preserve `cron`/`markdown`/`gateway` modules.
`src/acp/pool.rs`	Apply `SessionEntry` + `shutting_down` flag + `begin_shutdown()` on top of current `main`'s `mapping_path`/`load_mapping`/`save_mapping`. Do not overwrite persistence.
`src/adapter.rs`	Apply `broadcast_shutdown()` + shutdown admission logic. Preserve `origin_event_id` and table-mode flow from current `main`.
`src/discord.rs`	Formatting only — should auto-merge cleanly.
`src/slack.rs`	Formatting only — should auto-merge cleanly.

Once rebased and cargo test passes, the existing approvals + this triage should clear the path to merge.

chaodu-agent

PR Review: #182 — Broadcast Shutdown Notification

CHANGES REQUESTED — Solid design with atomic admission + snapshot, but the PR has merge conflicts (mergeStateStatus: DIRTY) that must be resolved before merge. The main branch has evolved significantly since this PR was opened.

Baseline Check (Step 0)

Main branch already has:

shutdown_signal() in main.rs — handles SIGINT + SIGTERM with #[cfg(unix)] gating
pool.shutdown() — clears active sessions on exit
router is Arc<AdapterRouter>, already cloned across Discord/Slack/Gateway/Cron tasks
get_or_create(&self, thread_id: &str) — single-arg signature
PoolState uses active: HashMap<String, Arc<Mutex<AcpConnection>>> (flat map)

Main does NOT have:

Any broadcast notification to active threads on shutdown
SessionEntry struct or ChannelRef/ChatAdapter stored in pool
begin_shutdown() / is_shutting_down() on SessionPool
broadcast_shutdown() on AdapterRouter

Verdict: Feature is entirely net-new. No stale/duplicate content.

Four Questions

1. What problem does it solve?

When the broker shuts down (SIGINT/SIGTERM, docker stop, helm uninstall), active conversations die silently. Users see replies cut off mid-stream with no explanation. RFC #78 Phase 1d calls this out.

2. How does it solve it?

Pool owns addressing: SessionPool stores ChannelRef + Arc<dyn ChatAdapter> alongside each connection in a new SessionEntry struct, keeping them in lockstep with sessions (renamed from active).
Atomic admission + snapshot: begin_shutdown() takes the state write lock, flips an AtomicBool flag, and snapshots all session addresses in one critical section.
Four-point admission guard: get_or_create checks is_shutting_down() at entry, inside state.read(), after the per-connection mutex (with a read-lock barrier), and inside the final state.write() before insert.
Platform-neutral broadcast: AdapterRouter::broadcast_shutdown() sends to all snapshot entries in parallel via JoinSet, capped by a 10s timeout.
Unified signal handling: wait_for_shutdown_signal() handles both SIGINT and SIGTERM on Unix.

3. What was considered?

This is a rewrite of the original Discord-specific approach against the multi-platform AdapterRouter/ChatAdapter abstraction from #259.
Wording is intentionally neutral ("Context will reset on return") — no promise of session resume (that is RFC #78 Phase 2).
Best-effort delivery: send errors and the 10s deadline both fall through to normal teardown.

4. Is this the best approach?

Strengths:

The atomic admission + snapshot design is well-reasoned. The four-point shutdown check in get_or_create is thorough and each check point is justified in comments.
Pool-owns-addressing is cleaner than a parallel cache in the adapter layer.
JoinSet + tokio::select! with deadline is the right pattern for bounded parallel broadcast.
The bot-loop prevention (skipping shutdown rejection for bot-authored events) is a nice edge case catch.

Concerns listed in Traffic Light below.

Traffic Light Summary

🔴 MERGE CONFLICTS — PR is in DIRTY merge state. Main has evolved (e.g., shutdown_signal() already exists, PoolState structure may have changed). Contributor must rebase onto current main and resolve conflicts.
🔴 Verify is_bot field — The shutdown-rejection path deserializes SenderContext and reads .is_bot. Confirm this field exists in the current SenderContext struct on main. If not, it needs to be added or the deserialization will silently default.
🟡 Consolidate signal handlers — Main already has shutdown_signal(). The PR adds wait_for_shutdown_signal() with identical logic. After rebase, consolidate to one.
🟡 _shutdown_barrier fragility — Consider extracting the read-lock-as-barrier pattern into a named helper method to prevent accidental removal.
🟢 Atomic admission design — The four-point shutdown check with justified comments at each point is excellent defensive programming.
🟢 Bot-loop prevention — Smart edge case handling for Slack where the bot's own broadcast could trigger a re-entry.
🟢 Timeout-bounded broadcast — JoinSet + tokio::select! with 10s deadline ensures shutdown is never blocked by a slow platform.

CI is all green (cargo check + 7 Docker smoke tests pass), but the merge conflicts must be resolved first. Previous reviews: masami-agent commented (no blockers), shaun-agent approved, obrutjack's approval was dismissed (likely due to new commits).

Action needed: Contributor @ruan330 please rebase onto current main and resolve conflicts. The shutdown_signal() function already exists on main — consolidate with your wait_for_shutdown_signal() during rebase.

chaodu-agent · 2026-05-01T18:36:21Z

Not planned for now.

ruan330 requested a review from thepagent as a code owner April 10, 2026 11:25

thepagent added the p2 Medium — planned work label Apr 15, 2026

ruan330 force-pushed the feat/shutdown-broadcast branch from 04d1e0e to 82c8a7e Compare April 16, 2026 08:57

github-actions Bot added the closing-soon PR missing Discord Discussion URL — will auto-close in 3 days label Apr 16, 2026

thepagent removed the closing-soon PR missing Discord Discussion URL — will auto-close in 3 days label Apr 16, 2026

masami-agent reviewed Apr 17, 2026

View reviewed changes

ruan330 force-pushed the feat/shutdown-broadcast branch from 82c8a7e to fa106cd Compare April 18, 2026 13:31

github-actions Bot added closing-soon PR missing Discord Discussion URL — will auto-close in 3 days and removed closing-soon PR missing Discord Discussion URL — will auto-close in 3 days labels Apr 18, 2026

ruan330 force-pushed the feat/shutdown-broadcast branch from fa106cd to 12aa620 Compare April 18, 2026 14:06

masami-agent reviewed Apr 19, 2026

View reviewed changes

obrutjack previously approved these changes Apr 19, 2026

View reviewed changes

github-actions Bot added needs-rebase pending-contributor and removed needs-rebase pending-contributor labels Apr 19, 2026

ruan330 and others added 3 commits April 20, 2026 13:18

fix: adapt shutdown broadcast to sender_json API

3fe3137

fix: derive shutdown bot flag from sender_json

44a9cdd

ruan330 force-pushed the feat/shutdown-broadcast branch from 13563f9 to 44a9cdd Compare April 20, 2026 06:35

github-actions Bot added pending-contributor needs-rebase and removed needs-rebase pending-contributor labels Apr 24, 2026

chaodu-agent reviewed May 1, 2026

View reviewed changes

Conversation

ruan330 commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Design

Behaviour notes

Tests & verification

Test plan

Uh oh!

masami-agent commented Apr 16, 2026

Review — PR #182

✅ What looks good

🔴 Must fix before merge

🟡 Non-blocking

Summary

Uh oh!

masami-agent left a comment

Choose a reason for hiding this comment

PR Review: #182

Summary

Core Assessment

Findings

Review Summary

🔴 Blockers

💬 Questions

🔧 Suggested Changes

ℹ️ Info

⚪ Nits

Verdict

Uh oh!

masami-agent left a comment

Choose a reason for hiding this comment

Uh oh!

masami-agent Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

masami-agent Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

masami-agent Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

masami-agent Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

masami-agent Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

shaun-agent commented Apr 18, 2026

OpenAB PR Screening

Feat

Who It Serves

Rewritten Prompt

Merge Pitch

Best-Practice Comparison

Implementation Options

Option 1: Minimal signal-path broadcast

Option 2: Lifecycle notifier abstraction

Option 3: Broader graceful-shutdown framework

Comparison Table

Recommendation

Uh oh!

ruan330 commented Apr 18, 2026

Uh oh!

ruan330 commented Apr 18, 2026

Uh oh!

shaun-agent commented Apr 19, 2026

Uh oh!

shaun-agent commented Apr 19, 2026

Uh oh!

masami-agent left a comment

Choose a reason for hiding this comment

PR Review: #182

Summary

Core Assessment

Findings

Review Summary

🔴 Blockers

💬 Questions

ruan330 commented Apr 10, 2026 •

edited

Loading